Dimension Encoding for Bitwise Dimensional Co-Clustering

نویسندگان

  • Stephan Baumann
  • Peter Boncz
چکیده

In this technical report we explain how to create skew-resistant balanced dimensions for our clustering scheme Bitwise Dimensional Co-Clustering (short BDCC) based on histograms and Hu-Tucker encoding. This is needed to avoid unreliable precision in BDCCscan when scanning tables at different granularities.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Predictive Overlapping Co-Clustering

In the past few years co-clustering has emerged as an important data mining tool for two way data analysis. Coclustering is more advantageous over traditional one dimensional clustering in many ways such as, ability to find highly correlated sub-groups of rows and columns. However, one of the overlooked benefits of co-clustering is that, it can be used to extract meaningful knowledge for variou...

متن کامل

Model-based Co-clustering for High Dimensional Sparse Data

We propose a novel model based on the von Mises-Fisher (vMF) distribution for coclustering high dimensional sparse matrices. While existing vMF-based models are only suitable for clustering along one dimension, our model acts simultaneously on both dimensions of a data matrix. Thereby it has the advantage of exploiting the inherent duality between rows and columns. Setting our model under the m...

متن کامل

Clustering Algorithms For High Dimensional Data – A Survey Of Issues And Existing Approaches

Clustering is the most prominent data mining technique used for grouping the data into clusters based on distance measures. With the advent growth of high dimensional data such as microarray gene expression data, and grouping high dimensional data into clusters will encounter the similarity between the objects in the full dimensional space is often invalid because it contains different types of...

متن کامل

Clustering on a Subspace of Exponential Family Using Variational Bayes Method

The e-PCA has been proposed to reduce the dimension of the parameters of probability distributions using Kullback information as a distance between two distributions. It also provides a framework for dealing with various data types such as binary and integer for which the Gaussian assumption on the data distribution is inappropriate. In this paper, we introduce a latent variable model for the e...

متن کامل

Efficient high dimension data clustering using constraint-partitioning k-means algorithm

With the ever-increasing size of data, clustering of large dimensional databases poses a demanding task that should satisfy both the requirements of the computation efficiency and result quality. In order to achieve both tasks, clustering of feature space rather than the original data space has received importance among the data mining researchers. Accordingly, we performed data clustering of h...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012